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Abstract 

We address the problem of automati- 
cally acquiring case frame patterns (se- 
lections! patterns) from large corpus 
data. In particular, we propose a method 
of learning dependencies between case 
frame slots. We view the problem of 
learning case frame patterns as that of 
learning multi-dimensional discrete joint 
distributions, where random variables 
represent case slots. We then formal- 
ize the dependencies between case slots 
as the probabilistic dependencies between 
these random variables. Since the num- 
ber of parameters in a multi-dimensional 
joint distribution is exponential, it is in- 
feasible to accurately estimate them in 
practice. To overcome this difficulty, 
we settle with approximating the target 
joint distribution by the product of low 
order component distributions, based on 
corpus data. In particular we propose 
to employ an efficient learning algorithm 
based on the MDL principle to realize 
this task. Our experimental results in- 
dicate that for certain classes of verbs, 
the accuracy achieved in a disambigua- 
tion experiment is improved by using the 
acquired knowledge of dependencies. 

1 Introduction 

We address the problem of automatically acquir- 
ing case frame patterns (selectional patterns) from 
large corpus data. The acquisition of case frame 
patterns normally involves the following three 
subproblems: 1) extracting case frames from cor- 
pus data, 2) generalizing case frame slots within 
the case frames, 3) learning dependencies that ex- 



In this paper, we propose a method of learn- 
ing dependencies between case frame slots. By 
'dependency' is meant the relation that exists be- 
tween case slots which constrains the possible val- 
ues assumed by each of those case slots. As illus- 
trative examples, consider the following sentences. 

The girl will fly a jet. (1) 

This airline company flies many jets. (2) 
The girl will fly Japan Airlines. (3) 
*The airline company will fly Japan Airlines. 

; ' ' (4) 

We see that an 'airline company' can be the sub- 
ject of verb 'fly' (the value of slot 'argl'), when 
the direct object (the value of slot 'arg2') is an 
'airplane' but not when it is an 'airline company.' 
These examples indicate that the possible values 
of case slots depend in general on those of the 
other case slots: that is, there exist 'dependen- 
cies' between different case slots. 

The knowledge of such dependencies is useful in 
various tasks in natural language processing, es- 
pecially in analysis of sentences involving multiple 
prepositional phrases, such as 

The girl will fly a jet from Tokyo to Beijing. 

(5) 

Note in the above example that the slot of 'from' 
and that of 'to' should be considered dependent 
and the attachment site of one of the prepositional 
phrases (case slots) can be determined by that of 
the other with high accuracy and confidence. 

There has been no method proposed to date, 
however, that learns dependencies between case 
slots in the natural language processing literature. 
In the past research, the distributional pattern of 
each case slot is learned independently, and meth- 
ods of resolving ambiguities are also based on the 
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and Brooks, 1995). Thus, provision of an effec- 
tive method of learning dependencies between case 
slots, as well as investigation of the usefulness of 
the acquired dependencies in disambiguation and 
other natural language processing tasks would be 
an important contribution to the field. 

In this paper, we view the problem of learn- 
ing case frame patterns as that of learning multi- 
dimensional discrete joint distributions, where 
random variables represent case slots. We then 
formalize the dependencies between case slots as 
the probabilistic dependencies between these ran- 
dom variables. Since the number of parameters 
that exist in a multi-dimensional joint distribution 
is exponential if we allow n-ary dependencies, it is 
infeasible to accurately estimate them with a data 
size available in practice. It is also clear that rel- 
atively few of these random variables (case slots) 
are actually dependent on each other with any sig- 
nificance. Thus it is likely that the target joint 
distribution can be approximated reasonably well 
by the product of component distributions of low 
order, drastically reducing the number of param- 
eters that need to be considered. This is indeed 
the approach we take in this paper. 

Now the problem is how to approximate a 
joint distribution by the product of low order 
component distributions. Recently, Suzuki pro- 
posed an algorithm to approximately learn a 
multi-dimensional discrete joint distribution ex- 
pressible as a 'dendroid distribution,' which is 
both efficient and theoretically sound ({Suzuki 



1993|). We employ Suzuki's algorithm to learn case 
frame patterns as dendroid distributions. We con- 
ducted some experiments to automatically acquire 
case frame patterns from the Penn Tree Bank 
bracketed corpus. Our experimental results in- 
dicate that for some classes of verbs the accuracy 
achieved in a disambiguation experiment can be 
improved by using the acquired knowledge of de- 
pendencies between case slots. 

2 Probabilistic Models for Case 
Frame Patterns 



(fly (argl girl) (arg2 jet)) 
(fly (argl company) (arg2 jet)) 
(fly (argl girl) (arg2 company)) 

Figure 1: Example case frames generated by a 
word-based model 



Suppose that we have data of the type shown in 
Figure [l], given by case frame instances of verb 'fly' 
automatically extracted from a corpus, using con- 
ventional techniques. As explained in Introduc- 
tion, the problem of learning case frame patterns 
can be viewed as that of estimating the underly- 
ing multi-dimensional joint discrete distributions 
which give rise to such data. In this research, we 
assume that case frame instances with the same 
head are generated by a joint distribution of type, 



Py(Xi,X 2 , . . . , X n ), 



(6) 



where index Y stands for the head, and each of the 
random variables Xi, i = 1, 2, . . . , n, represents a 
case slot. In this paper, we use 'case slots' to mean 
surface case slots, and we uniformly treat obliga- 
tory cases and optional cases. Thus the number 
n of the random variables is roughly equal to the 
number of prepositions in English (and less than 
100). 

These models can be further classified into three 
types of probabilistic models according to the type 
of values each random variable Xi assumes. When 
Xi assumes a word or a special symbol '0' as 
its value, we refer to the corresponding model 
Py(Xi, . . . , X n ) as a 'word-based model.' Here '0' 
indicates the absence of the case slot in question. 
When Xi assumes a word-class or '0' as its value, 
the corresponding model is called a 'class-based 
model.' When Xi takes on 1 or as its value, 
we call the model a 'slot-based model.' Here the 
value of T' indicates the presence of the case slot 
in question, and '0' absence. For example, the 
data in Figure |l| can be generated by a word-based 
model, and the data in Figure || by a class-based 
model. Suppose for simplicity that there are only 
4 possible case slots corresponding respectively to 
the subject, direct object, 'from' phrase, and 'to' 
phrase. Then, 

Pfly(Xargl = girl, X arg 2 = jet, Xf rom — 0, X to = 0) 

(7) 

is given a specific probability value by a word- 
based model. In contrast, 



Pfi v (X arg i = (person), X arg 2 = (airplane 
)> Xf rom — 0, X to = 0) 



(8) 



is given a specific probability value by a class- 
based model, where (person) and (airplane) de- 
note word classes. Finally, 

Pfly(X a rgl = 1, X arg 2 = 1, Xf rom — 0, X to = 0) 

(9) 

is assigned a specific probability value by a slot- 
based model. We then formulate the dependencies 



< . . . > : word class 

(fly (argl <person>) (arg2 <airplane>) ) 

(fly (argl <person>) (arg2 <airplane>) ) 

(fly (argl <person>) (arg2 <airplane>) ) 

(fly (argl <company>) (arg2 <airplane>) ) 

(fly (argl <company>) (arg2 <airplane>) ) 

(fly (argl <person>) (arg2 <company>) ) 

(fly (argl <person>) (to <place>)) 

(fly (argl <person>) (f rom <place>) (to <place>)) 

(fly (argl < comp any >) (from <place>) (to <place>)) 

Figure 2: Example case frames generated by a class-based model 



between case slots as the probabilistic dependen- 
cies between the random variables in each of these 
three models. 

In the absence of any constraints, however, the 
number of parameters in each of the above three 
models is exponential (even the slot-based model 
has 0(2") parameters ), and thus it is infeasible to 
curatory estimate them in practice — A_ 



assump- 



tion that is often made to deal with this difficuhy 

is that random variables (case slots) are mutually 
independent. 

Suppose for example that in the analysis of the 
sentence 

The girl will fly a jet from Tokyo, (10) 
the following alternative interpretations are given, 
(fly (argl girl) (arg2 jet) (from Tokyo)) (11) 

(fly (argl girl) (arg2 (jet (from Tokyo)))) (12) 

We wish to select the more appropriate of the two 
interpretations. A heuristic word-based method 
for disambiguation, in which the random variables 
(case slots) are assumed to be dependent, is to 
calculate the following values of word-based like- 
lihood and to select the interpretation with the 
higher likelihood value. 



(14) 



Pfiy(X arg i = girl, X arg2 = jet, X from = Tokyo) 

(13) 

Pfly(Xargl = gkl, X arg2 = jet) 

xP jet (X from = Tokyo) 

If on the other hand we assume that the random 
variables are independ ent, we only need to calcu- 
late and compare (c.f.( |Li and Abe, 1995 )) 



p fiy( x fr 



Tokyo) 



(15) 



and 



P jet {Xf rom = Tokyo). 



(16) 

The independence assumption can also be made 
in the case of a class-based model or a slot-based 



model. For slot-based models, with the indepen- 
dence assumption, the following probabilities 



Pfly( X from — 1) 
Pjet(Xf rom 1) 



(17) 
(18) 



are to be compared (c.f.(Hindlc and Rooth, 



1991)) 



Assuming that random variables (case slots) 
are mutually independent would drastically re- 
duce the number of parameters. (Note that un- 
der the independence assumption the number of 
parameters in a slot-based model becomes 0(n).) 
As illustrated in Section 1, this assumption is not 
necessarily valid in practice. 

What seems to be true in practice is that some 
case slots are in fact dependent but overwhelming 
majority of them are independent, due partly to 
the fact that usually only a few case slots are oblig- 
atory and most others are optional.^ Thus the tar- 
get joint distribution is likely to be approximable 
by the product of several component distributions 
of low order, and thus have in fact a reasonably 
small number of parameters. We are thus lead 
to the approach of approximating the target joint 
distribution by such a simplified model, based on 
corpus data. 

3 Approximation by Dendroid 
Distribution 

Without loss of generality, any n-dimensional joint 
distribution can be written as 

n 

P(Xi, X2, ■ ■ ■ , X n ) — J^J P(X mi \X mi ...X mi l ) 

i=l 

(19) 



1 Optional case slots are not necessarily indepen- 
dent, but if two optional case slots are randomly se- 
lected, it is likely that they are independent of one 
another. 



for some permutation (mi,m2,...m n ) of 1,2,.., n, 
where we let P(X mi \X mo ) denote P(X mi ). 

A plausible assumption on the dependencies be- 
tween random variables is intuitively that each 
variable directly depends on at most one other 
variable. (Note that this assumption is the sim- 
plest among those that relax the independence as- 
sumption.) For example, if a joint distribution 
P{X\, X 2 , X3) over 3 random variables Xi, X2, X3 
can be written (approximated) as follows, it (ap- 
proximately) satisfies such an assumption. 

P{X 1 ,X 2 ,X 3 ) = (^)P(X 1 )xP(X 2 \X 1 )xP(X 3 \X 1 ) 

(20) 

Such distributions are referred to as 'dendroid dis- 
tributions' in the literature. A dendroid distribu- 
tion can be represented by a dependency forest 
(i.e. a set of dependency trees), whose nodes rep- 
resent the random variables, and whose directed 
arcs represent the dependencies that exist between 
these random variables, each labeled with a num- 
ber of parameters specifying the probabilistic de- 
pendency. (A dendroid distribution is a restricted 
form of the Bayesian network ( Pearl, 1988 ).) It 
is not difficult to sec that there are 7 and only 
7 such representations for the joint distribution 
P(X\, X 2 , X3) (See Figure ||), disregarding the ac- 
tual numerical values of the probabilistic parame- 
ters. 

Now we turn to the problem of how to select the 
best dendroid distribution from among all possi- 
ble ones to approximate a target joint distribution 
based on input data 'generated' by it. This prob- 
lem has been investigated in the area of machine 
learning and related fields. A classical method is 
Chow & Liu's algorithm for estimating a multi- 
dimensional joint distribution as a dependency 
tree, in a way which is both efficient and theoreti- 
cally sound ( Chow and Liu, 1968| ). More recently 
Suzuki extended their algorithm so that it esti- 
mates the target joint distribution as a dendroid 
distribution or dependency forest ( [Suzuki, 1993 ), 
allowing for the possibility of learning one group 
of random variables to be completely independent 
of another. Since many of the random variables 
(case slots) in a case frame pattern are essentially 
independent, this feature is crucial in our context, 
and we thus employ Suzuki's algorithm for learn- 



ing our case frame patterns. 



Suzuki's algorithm first calculates the mutual 



information between all two nodes (random vari- 
ables), and it sorts the node pairs in descending 
order with respect to the mutual information. It 
then puts a link between a node pair with the 
largest mutual information value /, provided that 
I exceeds a certain threshold which depends on 



the node pair and adding that link will not create 
a loop in the current dependency graph. It repeats 
this process until no node pair is left unprocessed. 
Figure || shows the detail of this algorithm, where 
ki denotes the number of possible values assumed 
by N the input data size, and log denotes the 
logarithm to the base 2. It is easy to see that the 
number of parameters in a dendroid distribution 
is of the order 0(nk 2 ), where k is the maximum of 
all ki, and n is the number of random variables. 
The time complexity of the algorithm is of the 
order 0(n 2 (k 2 + log n)). 

We will now show how the algorithm works by 
an illustrative example. Suppose that the data is 
given as in Figure]^ and there are 4 nodes (random 
variables) X arg i, X arg2 , Xf rom , X to . The values 
of mutual information and thresholds for all node 
pairs are shown in Table [j].f] Based on this cal- 
culation the algorithm constructs the dependency 
forest shown in Figure |^, because the mutual in- 
formation between X arg2 and X to , Xf rom and 
X to are large enough, but not the others. The 
result indicates that slot l arg2' and 'from' should 
be considered dependent on 'to.' Note that l arg2' 
and 'from' should also be considered dependent 
via 'to' but to a somewhat weaker degree. 



X, 



argl 



x, 



arg2 




^ from 



Figure 5: An example case frame pattern 



Suzuki's algorithm is derived from the Mini- 



mum Description Length (MDL) principle (Ris 



1978 



1986 



Rissanen, 1983 



Rissanen, 1984 



Ris- 



Rissanen, 1989) which is a principle 



for data compression and estimation from infor- 
mation theory and statistics. It is known that as 



2 The probabilities in this table are estimated by us- 
ing the so-called Expected Likelihood Esti mator, i.e 



by adding 0.5 to actual frequencies (c.f. ( |Galc and 
Church, 1990|)). 



x l 



x 2 x 3 
P(X 1 )P(X 2 )P(X 3 ) 



X! 



-x* 



x 2 

P(X 1 )P(X 2 )P(X 3 \X 2 ) 
P(X 1 )P(X 3 )P(X 2 \X 3 ) 




x 2 x 3 

P(X 1 )P(X 2 \X 1 )P(X 3 ) 
P(X 2 )P(X 1 \X 2 )P(X 3 ) 




x 2 x 3 

P(X 1 )P(X 3 \X 1 )P(X 2 ) 
P{X 3 )P(X 1 \X 3 )P{X 2 ) 




-x* 



X 2 



P(X 1 )P(X 2 \X 1 )P(X 3 \X 2 ) 

P(X 2 )P(X 1 IX 2 )P(X 3 IX 2 ) 

P(X 3 )P(X 2 \X 3 )P(X 1 \X 2 ) 




x 2 x 3 

P{X 2 )P{X l \X 2 )P{X 3 \X 1 ) 
P{X l )P{X 3 \X l )P{X 2 \X 1 ) 
P{X 3 )P(X 1 \X 3 )P{X 2 \X 1 ) 




P{X 1 )P(X 3 \X 1 )P(X 2 \X 3 ) 
= P(X 3 )P(X 1 |X 3 )P(X 2 |X 3 ) 
= P(X 2 )P(X 3 |X 2 )P(X 1 |X 3 ) 



Figure 3: Dendroid distributions 



a strategy of estimation, MDL is guaranteed to be 
near optimal.^ In applying MDL, we usually as- 
sume that the given data are generated by a prob- 
abilistic model that belongs to a certain class of 
models and selects a model within the class which 
best explains the data. It tends to be the case 
usually that a simpler model has a poorer fit to 
the data, and a more complex model has a better 
fit to the data. Thus there is a trade-off between 
the simplicity of a model and the goodness of fit to 
data. MDL resolves this trade-off in a disciplined 
way: It selects a model which is reasonably sim- 
ple and fits the data satisfactorily as well. In our 
current problem, a simple model means a model 
with less dependencies, and thus MDL provides 
a theoretically sound way to learn only those de- 
pendencies that are statistically significant in the 
given data. An especially interesting feature of 
MDL is that it incorporates the input data size 



3 We refer the interested rea der to (Quinlan and 
Rivest, 198S| ; [Li and Abe, 199E| ) for an introduction 



to MDL. 



in its model selection criterion. This is reflected, 
in our case, in the derivation of the threshold 8. 
Note that when we do not have enough data (i.e. 
for small N), the thresholds will be large and 
few nodes tend to be linked, resulting in a simple 
model in which most of the case slots are judged 
independent. This is reasonable since with a small 
data size most case slots cannot be determined to 
be dependent with any significance. 

4 Experimental Results 

We conducted some experiments to test the per- 
formance of the proposed method as a method of 
acquiring case frame patterns. In particular, we 
tested to see how effective the patterns acquired 
by our method are in a structural disambiguation 
experiment. We will describe the results of this 
experimentation in this section. 

4.1 Experiment 1: Slot-based Model 

In our first experiment, we tried to acquire case 
frame patterns as slot-based models. We ex- 



Algorithm: 

1. Let T := 0; 

2. Calculate the mutual information I(Xi,Xj) for all node pairs (Xj,X/) ; 

3. Sort the node pairs in descending order of /, and store them into queue Q; 

4. Let V be the set of {Xj}, i = 1,2, ...,n; 

5. while the maximum value of I of the node pair (Xi,Xj) in Q satisfies 

i(x i ,x j )>e(x i ,x j ) = (k i -i)(k j -iy^f 

do 

begin 

4-1. Remove the node pair (Xi, Xj) having the maximum value of I from Q\ 
4-2. if Xi and Xj belong to different sets W\,W2 in V; 
then 

Replace Wi and W 2 in V" with Wi U W 2 , and add edge {X l ,X j ) to T; 
end 

6. Output T as the set of edges of the estimated model. 



Figure 4: The learning algorithm 



Table 1: Mutual information and threshold values for node pairs 



/ : e 




X arg 2 


X from 




X ar gX 




0.01 : 0.35 


0.02 : 0.18 


0.00 : 0.18 


X arg 2 






0.22 : 0.35 


0.43 : 0.35 


X from 








0.26 : 0.18 


Xto 











tracted 181,250 case frames from the Wall Street 
Journal (WSJ) bracketed corpus of the Penn Tree 
Bank (Marcus et al., 1993) as training data. There 
were 357 verbs for which more than 50 case frame 
examples appeared in the training data. Table ^ 
shows the verbs that appeared in the data most 
frequently and the numbers of their occurrences. 

First we acquired the case frame patterns as 
slot-based models for all of the 357 verbs. We then 
conducted a ten-fold cross validation to evaluate 
the 'test data perplexities^] of the acquired case 
frame patterns, that is, we used nine tenth of the 
case frames for each verb as training data (saving 
what remains as test data), to acquire case frame 
pattern for the verb, and then calculated perplex- 
ity using the test data. We repeated this process 
ten times and calculated the average perplexity. 
Table || shows the average perplexities obtained 
for some randomly selected verbs. We also cal- 
culated the average perplexities of the 'indepen- 



4 The 'test data perplexity,' which is a measure 
of testing how well an estimated probabilistic model 
predicts some hitherto unseen data, is defined as 
2 h ( p t> p m\H(P t ,Pm) = • \ogP M (x), 

where Pm(x) denotes the probability function of the 
estimated model. Pt(x) th e distribution function of 
the data flBahl et al., 1983[). 



Table 2: Verbs appearing most frequently 



Verb 


Num. of frames 


be 


17713 


say 


9840 


have 


4030 


make 


1770 


take 


1245 


expect 


1201 


sell 


1147 


rise 


1125 


get 


1070 


go 


1042 


do 


982 


buy 


965 


fall 


862 


add 


740 


come 


733 


include 


707 


give 


703 


pay 


700 


see 


680 


report 


674 



dent slot models' acquired based on the assump- 
tion that each case slot is independent. Our ex- 
perimental results shown in Table [| indicate that 
the use of the dendroid models can achieve up to 
20% perplexity reduction as compared to the use 
of the independent slot models. It seems safe to 
say therefore that the dendroid model is more suit- 
able for representing the true model of case frames 
than the independent slot model. 

We also used the acquired dependency knowl- 
edge in a pp-attachment disambiguation exper- 
iment. We used the case frames of all 357 
verbs as our training data. We used the en- 
tire bracketed corpus as training data because we 
wanted to utilize as many training data as possi- 
ble. We extracted (verb,nouni,prep,nouri2) and 
(verb,prepi,nouni,prep2,noun2) patterns from 
the WSJ tagged corpus as test data, using pat- 
tern matching techniques such as that described 
in ( Smadja, 1993 ). We took care to ensure that 
only the part of the tagged (non-bracketed) corpus 
which does not overlap with the bracketed corpus 
is used as test data. (The bracketed corpus does 
overlap with part of the tagged corpus.) 

We acquired case frame patterns using the 
training data. Figure ^| shows an example of the 
results, which is part of the case frame pattern 
(dendroid distribution) for the verb 'buy' Note 
in the model that the slots 'for,' 'on,' etc, are 
dependent on l arg2,' while 'argl' and 'from' are 
independent. 

We found that there were 266 verbs, whose 
'arg2' slot is dependent on some of the other 
preposition slots. Table [| shows 37 of the verbs 
whose dependencies between arg2 and other case 
slots are positive and exceed a certain threshold, 
i.e. P{arg2 — l,prep = 1) > 0.25.0 The depen- 
dencies found by our method seem to agree with 
human intuition in most cases. 

There were 93 examples in the test data 
(verb,nouni,prep,noun2 pattern) in which the two 
slots 'arg2' and prep of verb are determined to be 
positively dependent and their dependencies are 
stronger than the threshold of 0.25. We forcibly 
attached prep nouri2 to verb for these 93 exam- 
ples. For comparison, we also tested the disam- 
biguation method based on the independence as- 



Table 5: Disambiguation results 1 



sumption proposed by (Li and Abe, 1995) on these 
examples. Table ^ shows the results of these ex- 
periments, where 'Dendroid' stands for the former 
method and 'Independent' the latter. We see that 
using the information on dependency we can sig- 





Accuracy (%) 


Dendroid 
Independent 


90/93(96.8) 
79/93(84.9) 



Table 7: Disambiguation results 2 





Accuracy (%) 


Dendroid 
Independent 


21/21(100) 
20/21(95.2) 



5 We uniformly treat the head of a noun phrase im- 
mediately after a verb as 'arg2' (including, for example 
'30%' in 'rise 30% from billion'). 



nificantly improve the disambiguation accuracy on 
this part of the data. Since we can use existing 
methods to perform disambiguation for the rest 
of the data, we can improve the disambiguation 
accuracy for the entire test data using this knowl- 
edge. 

Furthermore, we found that there were 140 
verbs having inter-dependent preposition slots. 
Table |^ shows 22 out of these 140 verbs such 
that their case slots have positive dependency 
that exceeds a certain threshold, i.e. P(prep\ = 
l,prep2 = 1) > 0.25. Again the dependencies 
found by our method seem to agree with human 
intuition. 

In the test data (verb,prepi,nouni,prep2,noun,2 
pattern), there were 21 examples that involves 
one of the above 22 verbs whose preposition slots 
show dependency exceeding 0.25. We forcibly 
attached both prep\ noun\ and prep2 nouri2 to 
verb on these 21 examples, since the two slots 
prepi and prep2 are judged dependent. Table ^ 
shows the results of this experimentation, where 
'Dendroid' and 'Independent' respectively repre- 
sent the method of using and not using the de- 
pendencies. Again, we find that for the part of 
the test data in which dependency is present, the 
use of the dependency knowledge can be used to 
improve the accuracy of a disambiguation method, 
although our experimental results are inconclusive 
at this stage. 

4.2 Experiment 2: Class-based Model 

We also used the 357 verbs and their case frames 
used in Experiment 1 to acquire case frame pat- 
terns as class-based models using the proposed 
method. We randomly selected 100 verbs among 
these 357 verbs and attempted to acquire their 
case frame patterns. We generalized the case slots 
within each o f these case frames using the method 
proposed by ( Li and Abe, 1995 ) to obtain class- 
based case slots, and then replaced the word-based 
case slots in the data with the obtained class- 



Table 3: Verbs and their perplexities 



Verb 


Independent 


Dendroid (Reduction in percentage) 


add 


5.82 


5.36(8%) 


buy 


5.04 


4.98(1%) 


find 


2.07 


1.92(7%) 


open 


20.56 


16.53(20%) 


protect 


3.39 


3.13(8%) 


provide 


4.46 


4.13(7%) 


represent 


1.26 


1.26(0%) 


send 


3.20 


3.29(-3%) 


succeed 


2.97 


2.57(13%) 


tell 


1.36 


1.36(0%) 



buy: 

[argl] : [P(argl=0)=0. 000571] [P(argl=l)=0 . 999429] 

[arg2] : [P(arg2=0) =0.055114] [P(arg2=l)=0 . 944886] 

[P(on=l|arg2=l)= 0.018630] [P(on=0 I arg2=l)= 0.981370] 
[P(on=l|arg2=0)= 0.112245] [P(on=0 I arg2=0)= 0.887755] 
[P(for=l|arg2=l)= 0.109976] [P(f or=0 I arg2=l)= 0.890024] 
[P(for=l|arg2=0)= 0.255102] [P(f or=0 I arg2=0)= 0.744898] 
[P(by=l|arg2=l)= 0.004207] [P(by=0 I arg2=l)= 0.995793] 
[P(by=l Iarg2=0)= 0.051020] [P(by=0 I arg2=0)= 0.948980] 

[on]: [P(on=0) =0.976705] [p(on=l)=0 . 023295] 

[for]: [P(for=0)=0. 882386] [P(for=l)=0. 117614] 

[by]: [P(by=0) =0.993750] [P(by=l)=0 . 006250] 

[from]: [P(from=0) =0.933523] [P(from=l)=0. 066477] 



Figure 6: An example case frame pattern (dendroid distribution) 



based case slots. What resulted are class-based 
case frame examples like those shown in Figure ^. 
We used these data as input to the learning algo- 
rithm and acquired a case frame pattern for each 
of the 100 verbs. We found that no two case slots 
are determined as dependent in any of the case 
frame patterns. This is because the number of 
parameters in a class based model is very large 
compared to the size of the data we had available. 

Our experimental result verifies the validity in 
practice of the assumption widely made in statis- 
tical natural language processing that class-based 
case slots (and also word-based case slots) are mu- 
tually independent, at least when the data size 
available is that provided by the current version 
of the Penn Tree Bank. This is an empirical find- 
ing that is worth noting, since up to now the inde- 
pendence assumption was based solely on human 
intuition, to the best of our knowledge. 

To test how large a data size is required to es- 
timate a class-based model, we conducted the fol- 
lowing experiment. We defined an artificial class- 
based model and generated some data according 
to its distribution. We then used the data to 
estimate a class-based model (dendroid distribu- 
tion), and evaluated the estimated model by mea- 



suring the number of dependencies (dependency 
arcs) it has and the KL distance^] between the es- 
timated model and the true model. We repeat- 
edly generated data and observed the 'learning 
curve,' namely the relationship between the num- 
ber of dependencies in the estimated model and 
the data size used in estimation, and the relation- 
ship between the KL distance between the esti- 
mated and true model and the data size. We de- 
fined two other models and conducted the same 
experiments. Figure shows the results of these 
experiments for these three artificial models aver- 
aged over 10 trials. (The number of parameters in 
Modell, Model2, and Model3 are 18, 30, and 44 
respectively, while the number of dependencies are 
1,3, and 5 respectively.) We see that to accurately 
estimate a model the data size required is as large 
as 100 times the number of parameters. Since 
a class-based model tends to have more than 100 



6 The KL distance (KL divergence or relative en- 
tropy) which is widely used in information theory and 
statistics, i s a measure of 'distance' b etween two dis- 
tributions ( Cover and Thomas, 1991 ). It is always 
non-negative and is zero if and only if the two distri- 
butions are identical, but is asymmetric and hence not 
a metric (the usual notion of distance). 



Table 4: Verbs and their dependent case slots 



Verb 


Dependent slots 


Example 


achieve 


arg2 


in 


achieve breakthrough in 1987 


acquire 


arg2 


in 


acquire share in market 


add 


arg2 


to 


add 1 to 3 


begin 


arg2 


in 


begin proceeding in London 


blame 


arg2 


for 


blame school for limitation 


buy 


arg2 


for 


buy property for cash 


charge 


arg2 


in 


charge man in court 


climb 


arg2 


from 


climb 20% from million 


compare 


arg2 


with 


compare profit with estimate 


convert 


arg2 


to 


convert share to cash 


defend 


arg2 


against 


defend themselves against takeover 


earn 


arg2 


on 


earn billion on revenue 


end 


arg2 


at 


end day at 778 


explain 


arg2 


to 


explain it to colleague 


fall 


arg2 


in 


fall 45% in 1977 


file 


arg2 


against 


file suit against company 


file 


arg2 


with 


file issue with commission 


finish 


arg2 


at 


finish point at 22 


focus 


arg2 


on 


focus attention on value 


give 


arg2 


to 


give business to firm 


increase 


arg2 


to 


increase number to five 


invest 


arg2 


in 


invest share in fund 


negotiate 


arg2 


with 


negotiate rate with advertiser 


open 


arg2 


in 


open bureau in capital 


pay 


arg2 


for 


pay million for service 


play 


arg2 


in 


play role in takeover 


prepare 


arg2 


for 


prepare case for trial 


provide 


arg2 


for 


provide engine for plane 


pull 


arg2 


from 


pull money from market 


refer 


arg2 


to 


refer inquiry to official 


return 


arg2 


to 


return car to dealer 


rise 


arg2 


from 


rise 10% from billion 


spend 


arg2 


on 


spend money on production 


surge 


arg2 


in 


surge 10% in 1988 


surge 


arg2 


to 


surge 25% to million 


trade 


arg2 


in 


trade stock in transaction 


turn 


arg2 


to 


turn ball to him 


withdraw 


arg2 


from 


withdraw application from office 



"modeh" 
"model2" 
"model3" 




"model 1 " 
"model2" 
"model3" 



Figure 7: (a) Number of dependencies versus data size and (b) KL distance versus data size 



parameters usually, the current data size available 
in the Penn Tree Bank (See Table||) is not enough 
for accurate estimation of the dependencies within 
case frames of most verbs. 

5 Conclusions 

We conclude this paper with the following re- 
marks. 

1. The primary contribution of research re- 
ported in this paper is that we have proposed 
a method of learning dependencies between 
case frame slots, which is theoretically sound 
and efficient, thus providing an effective tool 
for acquiring case dependency information. 

2. For the slot-based model, sometimes case 
slots are found to be dependent. Experimen- 
tal results demonstrate that using the depen- 
dency information, when dependency does 
exist, structural disambiguation results can 
be improved. 

3. For the word-based or class-based models, 
case slots are judged independent, with the 
data size currently available in the Penn Tree 
Bank. This empirical finding verifies the in- 
dependence assumption widely made in prac- 
tice in statistical natural language processing. 

We proposed to use dependency forests to repre- 
sent case frame patterns. It is possible that more 
complicated probabilistic dependency graphs like 
Bayesian networks would be more appropriate for 
representing case frame patterns. This would re- 
quire even more data and thus the problem of 
how to collect sufficient data would be a crucial 
issue, in addition to the methodology of learning 
case frame patterns as probabilistic dependency 
graphs. Finally the problem of how to determine 
obligatory/optional cases based on dependencies 
acquired from data should also be addressed. 
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Table 6: Verbs and their dependent case slots 



Head 


Dependent slots 


Example 


acquire 


from for 


acquire from corp. for million 


apply 


for to 


apply to commission for permission 


boost 


from to 


boost from 1% to 2% 


climb 


from to 


climb from million to million 


climb 


in to 


climb to million in segment 


cut 


from to 


cut from 700 to 200 


decline 


from to 


decline from billion to billion 


end 


at on 


end at 95 on screen 


fall 


from to 


fall from million to million 


grow 


from to 


grow from million to million 


improve 


from to 


improve from 10% to 50% 


increase 


from to 


increase from million to million 


jump 


from to 


jump from yen to yen 


move 


from to 


move from New York to Atlanta 


open 


at for 


open for trading at yen 


raise 


from to 


raise from to 5% to 10% 


reduce 


from to 


reduce from 5% to 1% 


rise 


from to 


rise from billion to billion 


sell 


to for 


sell to bakery for amount 


shift 


from to 


shift from stock to bond 


soar 


from to 


soar from 10% to 15% 


think 


of as 


think of this as thing 



